These lessons were designed to build and reinforce each other. It’s possible to write purely linear code, but purrr makes the code easier to read, and remember, and use again. Each csv can be read and examined in isolation, but rbind and dplyr make it possible to aggregate and discover trends. What follows are the steps to replicate the disocvery of one particular problem in the mock data: excessively good R2 data.
For this lesson, we’re fudging slightly by already knowing what we want to start looking for.
workingDir <- file.path(rootDir,"class_data")
oneYearBatches <- list.files(workingDir,pattern="xlsx$") %>%
file.path(workingDir,.) %>%
map_dfr(read.xlsx) %>%
as_tibble() %>%
type_convert()
## Parsed with column specification:
## cols(
## batchName = col_character(),
## instrumentName = col_character(),
## compoundName = col_character(),
## reviewerName = col_character(),
## batchCollectedTimestamp = col_datetime(format = ""),
## reviewStartTimestamp = col_datetime(format = ""),
## reviewCompleteTimestamp = col_datetime(format = "")
## )
ggplot(oneYearBatches, aes(x=calibrationR2,color=compoundName)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(oneYearBatches, aes(x=batchCollectedTimestamp,y=calibrationR2,color=compoundName)) +
geom_line()
## Warning in as.POSIXlt.POSIXct(x): unknown timezone 'zone/tz/2017c.1.0/
## zoneinfo/America/Indiana/Indianapolis'
There’s something interesting going on with the R2 values in the month of May, where a large number of them report a value of 1.0 – a perfect fit. Let’s focus on that month, and spread out the data so we can clarify whether it’s all compounds or just oxymorphone (the magenta color on top).
mayPlot <- oneYearBatches %>%
filter(batchCollectedTimestamp > ymd("2017-04-15"), batchCollectedTimestamp < ymd("2017-06-15")) %>%
ggplot(aes(x=batchCollectedTimestamp,y=calibrationR2,color=compoundName))
mayPlot +
geom_line() +
facet_wrap(~ compoundName)
mayPlot +
geom_point() +
facet_grid(reviewerName ~ instrumentName)
Whatever is going on, it looks like reviewer ‘Dave’ is the only person it is happening to.
Based on the batch-level data, we can see that ‘Dave’ – and apparently only Dave – has perfect R2 values on every batch of data he reviewed throughout the month of May. Digging deeper will require merging information from the batch level with information at the sample (and possibly peak) level.
oneYearSamples <- list.files(workingDir,pattern="csv$") %>%
file.path(workingDir,.) %>%
map_dfr(read_csv)
davesData <- oneYearSamples %>%
left_join(select(oneYearBatches,-calibrationSlope,-calibrationIntercept)) %>%
filter(batchCollectedTimestamp > ymd("2017-04-20"),
batchCollectedTimestamp < ymd("2017-06-10"),
sampleType == "standard",
reviewerName == "Dave")
The following plots of davesData provide compelling evidence for what happened: Dave unselected the middle five calibrators in order to draw a straight line and maximize the R2 term.
davesData %>% ggplot(aes(x=batchCollectedTimestamp,y=usedForCurve,color=compoundName)) +
geom_point() +
facet_grid(compoundName ~ expectedConcentration) +
geom_vline(xintercept = as.numeric(as_datetime(c("2017-05-01","2017-06-01"))),linetype=1,colour="black")
davesData %<>% mutate(pctDiff = (concentration - expectedConcentration) / expectedConcentration,
within20 = abs(pctDiff) <= 0.2)
davesData %>% filter(compoundName == "codeine") %>%
ggplot(aes(x=batchCollectedTimestamp,y=pctDiff,color=within20)) +
geom_point() +
facet_wrap(~ expectedConcentration) +
ggtitle("Codeine Only") +
geom_vline(xintercept = as.numeric(as_datetime(c("2017-05-01","2017-06-01"))),linetype=1,colour="black")
## Warning: Removed 164 rows containing missing values (geom_point).
The second plot shows that calibrators were dropped regardless of whether they would be within 20% of the expected concentration, suggesting that they were dropped for some other reason. The data does not say why ‘Dave’ did this, but there are a couple of good guesses here which revolve around training.
We intentionally included several other issues within the database, which will require aggregation and plotting to discover.
Exercise : revealing an ion ratio problem
Ion ratios are particularly sensitive to instrument conditions, because the secondary fragmentation pattern requires a highly tuned spectrometer and the selection of two ions which don’t change much when the mass spec flucturates. We can look for both outlier spikes and stability trends, and separate those out across instruments, compounds, and sample types. There is one obviously outlying instrument, which is worth investigating later. Excluding that instrument, which compound+instrument shows an outlier pattern? Is there a problem here? What additional variables would help besides the ones captured in the sample data.frame?
# 1 #
oneYearSamples %>%
left_join(oneYearBatches) %>%
ggplot(aes(x=batchCollectedTimestamp,y=ionRatio,color=compoundName)) +
geom_smooth() +
facet_grid(compoundName ~ instrumentName) # doc swapped Quant and Qual
## Joining, by = c("batchName", "compoundName")
## `geom_smooth()` using method = 'gam'
# 2 #
oneYearSamples %>%
left_join(oneYearBatches) %>%
filter(instrumentName!="doc") %>%
ggplot(aes(x=batchCollectedTimestamp,y=ionRatio,color=compoundName)) +
geom_smooth() +
facet_grid(compoundName ~ instrumentName) # grumpy+oxycodone looks least like the others
## Joining, by = c("batchName", "compoundName")
## `geom_smooth()` using method = 'gam'
# 3 #
oneYearSamples %>%
left_join(oneYearBatches) %>%
filter(compoundName=="oxycodone") %>%
#filter(instrumentName=="grumpy" & compoundName=="oxycodone") %>%
ggplot(aes(x=batchCollectedTimestamp,y=ionRatio,color=instrumentName)) +
geom_smooth() # grumpy+oxycodone clearly outlying
## Joining, by = c("batchName", "compoundName")
## `geom_smooth()` using method = 'gam'
# 4 #
# and now we need to box-whisker peakArea Quant and peakArea Qual for oxycodone, by month, and by week
# and this should reveal that Quant didn't change but Qual did